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Abstract —While the HPC community is working towards the 
development of the first Exaflop computer (expected around 
2020), after reaching the Petaflop milestone in 2008 still only 
few HPC applications are able to fully exploit the capabilities 
of Petaflop systems. In this paper we argue that efforts for 
preparing HPC applications for Exascale should start before 
such systems become available. We identify challenges that 
need to be addressed and recommend solutions in key areas 
of interest, including formal modeling, static analysis and op¬ 
timization, runtime analysis and optimization, and autonomic 
computing. Furthermore, we outline a conceptual framework 
for porting HPC applications to future Exascale computing 
systems and propose steps for its implementation. 

I. Introduction 

Exascale computing Q is expected to revolutionize com¬ 
putational science and engineering by providing lOOOx the 
capabilities of currently available computing systems, while 
having a similar power footprint. The total performance of 
the 500 systems in the 44th TOP500 list (18 Nov 2014, 
http://top500.org/) is about 0.3 exaFLOPS. The HPC com¬ 
munity 0 is now working towards the development of the 
first Exaflop computer, expected around 2020, after reaching 
the Petaflop milestone in 2008. However, only a few HPC 
applications are so far able to fully exploit the capabilities 
of Petaflop systems 0. Examples of typical scalability for 
commonly used HPC applications in our organizations are 
provided in Table |T| As the existing HPC applications are the 
major HPC asset, it is important and challenging to increase 
their scalability and lifetime by making them Exascale-ready 
before 2020. 

The major challenge for preparing HPC applications for 
Exascale is that there is no Exascale system available 
yet. Currently all we have are assumptions about Exascale 
systems. Therefore the commonly used measurement-based 
approaches for reasoning about performance issues are not 
applicable. Pre-exascale systems (known as Summit and 
Sierra) that IBM El is developing for the U.S. Department 
of Energy will exceed 100 petaflops and may provide hints 
about the extreme-scale architectures of the future. 


This paper argues that efforts for preparing HPC ap¬ 
plications for Exascale should start before such systems 
become available. We identify challenges that need to be 
addressed and recommend solutions in areas that are relevant 
for porting HPC application to future Exascale computing 
systems, including formal modeling, static analysis and opti¬ 
mization, runtime analysis and optimization, and autonomic 
computing. 

We suggest that porting of HPC applications should be 
made by successive, stepwise improvements based on the 
currently available assumptions and data about Exascale sys¬ 
tems. This approach should support application improvement 
each time new information about future Exascale systems 
becomes available, including the time when the application 
is actually deployed and runs on a concrete Exascale system. 
A high-level application representation that captures key 
functional and non-functional properties in conjunction with 
the abstract machine model will enable programmers and 
tools to reason about and perform application improve¬ 
ments, and will serve as input to runtime systems to handle 
performance and energy optimizations and self-aware fault 
management. A tunable abstract machine model encapsu¬ 
lates current assumptions for future Exascale systems and 
enables a priori application improvement before the concrete 
execution platform is known. At runtime, the model is 
a posteriori tuned to support activities such as feedback- 
oriented code improvement or dynamic optimization. 

Major contributions of this paper include, 

• identification of challenges and recommendation of 
solutions in formal modeling (Section |Il-A| ), static anal¬ 
ysis and optimization (Section TO runtime analysis 
and optimization (Section |Il-C| ), autonomic computing 
(Section |II-D| ); 

• a conceptual framework for preparing HPC applica¬ 
tions for Exascale that supports a priori application 
improvements before the concrete execution platform is 
known as well as a posteriori optimization at runtime 
(Section |III-A| ); 

• a discussion of the related work (Section |TV|). 











Table I 

Typical current scalability (in processor cores) of commonly used HPC applications in our organizations. 


Code 

Application Domain 

Language 

Scalability 

WIEN2k 

Materials Science 

F90 

1024 

SIMONA 

Nano Science 

C++ 

16384 

ECHAM/MESSy 

Environmental Science 

F77/F90 

1000 

CORSIKA 

Astroparticle Physics 

F77/F90 

2500 

OpenFOAM 

Computational fluid dynamics 

C++ 

16384 

IBM Watson 

Graph Analytics 

C++ 

32768 

Bifrost 

Stellar atmosphere simulation 

F90 

6500 


II. Challenges and Recommendations 

In this section we identify challenges and recommend so¬ 
lutions in formal modeling, static analysis and optimization, 
runtime analysis and optimization, and autonomic comput¬ 
ing. 

A. Formal modeling 

Our goal is to adapt HPC application code to Exascale 
execution platforms to achieve good utilization of resources. 
For this, we need to address questions such as: 

1) What would happen if we change application or hard¬ 
ware layout? 

2) What would happen if we change some parameters of 
the execution platform? 

3) What would happen if we use a different execution 
platform? 

Unfortunately, answering these questions cannot be done 
experimentally at the concrete level because such platforms 
do not yet exist. An alternative is to address these questions 
at an abstract level, focusing only on relevant information 
without actually executing the program. 

We believe that relevant information in this context is not 
what the code aims to achieve (the result of the computation) 
but its corresponding resource footprints , that is, how com¬ 
putational tasks communicate and synchronize, the amount 
of resources (such as memory and computing time) these 
tasks require, and how they access and move data. 

In order to adapt the HPC code to a particular architecture 
we need to capture such resource footprints of software 
modules at different levels of granularity (e.g., program 
statements, blocks in procedure bodies and whole proce¬ 
dures), and be able to compare different task composi¬ 
tions. Consequently, the modeling language must feature 
massively parallel operators over such task-level resource 
footprints Q. A similar notion of resource footprints and 
composition can be used to express the properties of the 
architecture in a machine model to capture the resources 
that the architecture can make available to the code. 

Working with resource footprints can be supported by 
an abstract behavioral specification language 0 , in which 
models describe both tasks and deployments. These models 
can be used to predict the non-functional behavior of code 


before it is deployed, and to compare deployments using 
formal methods. This requires a formal semantics for the 
specification language that can be used to devise static 
analysis techniques. 

When developing code from scratch using a model-based 
approach, the resource footprints can be specified in tandem 
with the standard model in a model-driven development 0, 
oo, a, m. However, when building such models from 
existing HPC code, monitoring profiles of low- and medium- 
scale systems can be used to extract resource footprints 
that approximate the resource consumption in terms of 
probabilistic distributions. 

B. Static analysis and optimization 

The application of formal methods to parallel programs 
for analyzing functional properties, such as safety and live¬ 
ness, has a long tradition. For non-functional properties , 
such as execution time and energy consumption, most per¬ 
formance analysis approaches use monitoring and present 
statistical information to the user. These approaches are 
helpful to improve HPC application code, but they also have 
some shortcomings: 

1) Due to non-determinism, different program executions 
might lead to different observations. As a conse¬ 
quence, these methods are not able to provide reliable 
probabilistic information about average or worst-case 
execution times. 

2) They are based on execution on a real platform, 
thus they cannot be used to predict performance on 
Exascale computers, which are not available yet. 

3) These methods can be used to identify execution 
bottlenecks, but they cannot explain the reasons for 
these bottlenecks, and thus they do not offer any 
concrete support for code improvement. 

We expect that formal methods can address these limi¬ 
tations to provide performance analysis tools that consider¬ 
ably go beyond the state-of-the-art. A major step in this 
direction will be the usage of resource footprints which 
describe both HPC applications and execution platforms 
as abstract probabilistic models. Formal analyses can be 
applied to these models to predict their probabilistic be¬ 
havior. While a range of techniques are available for non- 





probabilistic programs, the analysis of parallel probabilis¬ 
tic programs still need development effort. To achieve a 
reasonable balance between scalability and precision for 
challenging HPC applications, it seems fundamental to use 
hybrid approaches D3 that combine techniques such as 
static analysis, dynamic analysis, simulation, (parametric) 
model checking fl2l . counterexample-guided abstraction 
refinement CE3, deductive approaches, etc. 

To deliver the envisaged performance analysis tools, we 
face the following challenges: (1) determining the com¬ 
putation of cost properties that are given by means of 
probabilistic distributions; (2) the inference of average cost 
in addition to the traditional worst-case cost; (3) take into 
account the underlying platform through a set of probabilis¬ 
tic parameters; (4) deal with massive and heterogeneous 
parallelism ifTH . fl5l . (T6l . fiT) which is challenging for 
program analysis in general; and (5) develop multi-objective 
resource usage analyses and optimizations. 


C. Runtime analysis and optimization 

Formal modeling and static analysis should be enhanced 
with analysis of measurements at runtime. Plenty of tools 
(for instance, http://www.vi-hps.org) have been developed 
for performance measurement and analysis of HPC appli¬ 
cations at runtime. However, these tools will experience 
several issues when applied to Exascale. The collection rates 
and the overall volume of monitoring data in an Exascale 
computing environment will exceed the scalability of present 
performance tools. Therefore, throttling the data volume will 
have to be applied online in order to store as less data as 
possible and as much as necessary for later post mortem 
analysis. However, simple profiling will not be sufficient due 
to loss of temporal information, thus a hybrid approach will 
have to be applied that performs on-the-fly trace analysis 
in order to discard irrelevant data, while retaining the same 
amount of information. 

The metric classification should be based on the formal 
model (see Section |IFA| ). Such an approach will provide a 
generic insight into the performance of an HPC application 
that can be used for detecting performance bottlenecks. The 
instrumentation and hardware counter monitoring should 
follow a similar procedure where source code probing should 
be applied automatically by using tools such as OPARI 
ED. While many tools for collecting metrics of computing 
performance have been developed, very few analysis tools 
exist for energy consumption metrics in adequate accuracy 
and time resolution necessary for the runtime performance 
analysis ff9l . (20). 

Currently, common approaches (see for example PRACE 
best practices (21) for optimizing HPC applications require 
per-case inspection of runtime performance measurement 
data, such as profiling and tracing data. After the critical re¬ 
gion has been determined, diverse heuristic approaches, such 
as “ trial and error ”, “educated guess ” or “rule of thumb ”, 


are applied to make changes in the affected source code 
sections. The most significant limitation of these heuristics 
and knowledge-based approaches is that, 

1) all changes are made directly and manually in the 
source code, and 

2) the effect of the changes does not always lead to 
an improved performance which makes necessary the 
repeating of all steps several times. 


Moreover, Exascale computers pose a multi-objective 
optimization problem, weighing out the effects of several 
sometimes incongruent requirements. Therefore, a system¬ 
atic and automatic approach for the optimization problem 
is essential to find the optimal solution. Another problem is 
that the critical section in an application typically changes 
with the optimization iterations and/or with upscaling, due 
to the law of diminishing returns, which makes the manual 
analysis and source code changes even more laborious and 
inefficient, even if done by an experienced HPC developer. 
Thus instrumentation, collection/measurement and analysis 
steps should be automated, for example based on high-level 
scalable tools il22l . EH . ED, and integrated into a feedback 
loop (see Section (TED]). 


D. Autonomic computing 

During the execution of an application, failures may 
occur or the application performance may be below the 
expectation. These issues are addressed typically by pro¬ 
grammers in a “trial and error ” manner, i.e. by manually 
changing and adapting their code to handle the failures 
and improve the performance. Our proposed framework 
(Section III-A| ) provides means for model-based failure 
handling or performance improvement based on autonomic 
computing. Autonomic computing addresses self-managing 
characteristics of distributed computing resources with the 
facilities to adapting to unpredictable changes while hiding 
management complexity to operators and users (25) • Among 
the explored categories, advanced-control based methods 
and more specifically distributed controllers are the first can¬ 
didates to realize autonomic computing in Exascale systems. 

We propose to devise methodologies to efficiently collect 
runtime information balancing the amount and cost for 
storage of monitoring data with the quality of monitored 
data necessary to make deductions about the application 
behavior (e.g. trace analysis). The goal is thereby to define 
methodologies to scale current monitoring tools to Exascale, 
balancing between quality and volume of monitoring data. 

Combining the information from both static code analysis 
and runtime analysis, as outlined above, we will iteratively 
apply objective-oriented transformations to legacy applica¬ 
tion code at a formal level based on the Exascale DSL 
model (see Section |II-A| ). To this end, we will automate 
the analysis, optimization and transformation processes by 
implementing a generic feedback loop independent of the 
concrete programming language, algorithms used and target 











hardware architecture. A feedback loop driver enables to 
link the static analysis tool, the runtime analysis tool, 
the knowledge database and multi-parameter multi-objective 
optimization. As output, a set of rules (policies) is generated 
which is then applied to transform the formal application 
model and to adapt the runtime environment parameters 
(cf. Figure [I]). After the transformations a new application 
executable is built and started in the adapted runtime envi¬ 
ronment. This described loop is iterated until convergence 
of the optimization. 

III. Conceptual Framework and Benefits 

In this section we propose a conceptual framework for 
porting HPC applications to Exascale computing systems. 
Furthermore, we highlight benefits of our conceptual frame¬ 
work in the context of Exascale computing. 

A. Conceptual Framework 

Our proposed approach for preparing HPC applications 
for Exascale is depicted in Figure [T] The usage of a Domain- 
Specific Exascale Language (DSEL) facilitates the program¬ 
mer to express non-functional aspects (like required time to 
solution, resilience or energy-efficiency) of the execution of 
scalable parallel HPC codes. DSEL has a formal operational 
semantics that enables the formal analysis of the code. 
The aim of the Scalable Model-based Analyzer (SMA) is 
to address non-functional properties of HPC codes, with 
a particular focus on scalability while complying with the 
crucial dimensions of resource consumption for Exascale 
computing: time, energy, and resilience. The SMA is re¬ 
sponsible for analyzing resource consumption in terms of 
time, energy, and resilience, based on developed DSELs. 
The Exascale Runtime Data Collector (ERDC) is responsible 
for scalable monitoring to extract important monitoring data 
through the utilization of various techniques like filtering , 
streaming , or data mining. The runtime information is used 
to verify or to tune the model of the code via the Autonomous 
Feedback Loop (AFL). To endow the system with self¬ 
adaption, control-theoretical concepts are incorporated in 
autonomic computing paradigm. Based on the autonomic 
technology for application optimization, programmers will 
be less dependent on the currently used “trial-and-error” 
approach. 

Our approach considers optimization opportunities during 
the application life cycle comprising improvements based 
on static code analysis, deployment-time optimization, and 
run-time optimization. The developed models are used to 
identify the potential for improvement of the scalability 
for HPC applications under study and suggest application 
modifications that may lead to better scalability. 

B. Benefits 

Exascale computing is not simply the continuation of a 
computational capability trend that has been proven true for 


the last five decades. First, while clock rate scaling is limited, 
complex multicore architectures and parallel computing still 
follow Moore’s law. Second, Exascale computing capability 
will finally allow complex real-life simulations and data 
analytics. The latter will greatly expand the horizons of 
scientific discovery and enable the new data-driven economy 
to become a reality. 

However, the Exascale promise faces a series of ob¬ 
stacles, with the most difficult being energy, scalability, 
reliability and programmability. Our proposal is to develop 
a holistic, unifying and mathematically founded framework 
to systematically attack the roots of these problems. That is, 
instead of attacking these problems separately, we propose 
a holistic approach to study them as a multi-parametric 
problem which will allow us to deeply understand their 
interplay and thus make the right decisions to navigate in 
this complex landscape. 

The benefits are targeting the full spectrum of actors and 
beneficiaries. System developers will have a much better 
path to design, while end users and application developers 
will benefit from increased scalability, performance, relia¬ 
bility and programmability. HPC centers will see a great 
increase in overall system usability and an energy budget 
that is affordable. This in turn has the potential to greatly 
limit and contain the overall impact of high end HPC to the 
environment. 

IV. Related Work 

In a prospective analysis of issues with extreme scale sys¬ 
tems (26), the importance of concurrency, energy efficiency 
and resilience of software, as well as software-hardware co¬ 
design has been elucidated. 

Focusing on energy-aware HPC numerical applications, 
the EXA2GREEN project (http://exa2green-project.eu) has 
developed energy-aware performance metrics (27) . as well 
as energy-aware basic algorithm motifs such as linear solvers 
128]]. Further work will strongly benefit from these results. 
Different power measurement interfaces available on current 
architecture generations have been evaluated and the role of 
the sampling rate has been discussed (29) . 

The AutoTune approach (30) employs the Periscope tun¬ 
ing framework ED to automate performance analysis and 
tuning of HPC applications with the goal to improve per¬ 
formance and energy efficiency. Therein, both performance 
analysis and tuning are performed automatically during a 
single run of the application. 

The DEEP project (32! has developed a novel Exascale- 
enabling supercomputing architecture with a matching soft¬ 
ware stack and a set of optimized grand-challenge simulation 
applications. The goal of the DEEP architecture is to enable 
unprecedented scalability and with an extrapolation to mil¬ 
lions of cores to take the DEEP concept to an Exascale level. 
The follow-up DEEPer project (http://www.deep-er.eu) is 
mainly focusing on I/O and resiliency aspects. 
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Figure 1. Our conceptual framework for porting HPC applications to Exascale computing systems 


The CRESTA project (http://www.cresta-project.eu) has 
adopted a co-design strategy for Exascale, including all 
aspects of hardware architectures, system and application 
software. A major asset from the CRESTA project is the 
Score-P measurement system E3 on which an integration 
and automation of performance analysis tools (cf. Sec¬ 
tion |II-Q can be based. In addition, efforts have been made 
on developing a domain-specific language for expressing 
parallel auto-tuning specifications and an adaptive runtime 
support framework. 

V. Summary 

Exascale computing will revolutionize high-performance 
computing, but the first Exascale systems are not expected 
to appear before 2020. In this paper we have argued that 
the effort for preparing HPC applications for Exascale 
should start now. We have proposed that porting of HPC 
applications should be made by successive, stepwise im¬ 
provements based on the currently available assumptions and 
data about Exascale systems. This approach should support 
application improvement each time new information about 
future Exascale systems becomes available, including the 
time when the application is actually deployed and runs on 
a concrete Exascale system. We have identified challenges 
that need to be addressed and recommended solutions in 
key areas of interest for our approach including: formal 
modeling, static analysis and optimization, runtime analysis 
and optimization, and autonomic computing. Our future 
research will address the development of a framework that 
supports the conceptual framework presented in this paper. 
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